Before embarking on developing statistical models and generating predictions, it is essential to understand your data. This is typically done using conventional numerical and graphical methods. John Tukey (Tukey, 1977) advocated the practice of exploratory data analysis (EDA) as a critical part of the scientific process.
“No catalog of techniques can convey a willingness to look for what can be seen, whether or not anticipated. Yet this is at the heart of exploratory data analysis. The graph paper and transparencies are there, not as a technique, but rather as a recognition that the picture examining eye is the best finder we have of the wholly unanticipated.”
Fortunately, we can dispense with the graph paper and transparencies and use software that makes routine work of developing the ‘pictures’ (i.e., graphical output) and descriptive statistics needed to explore our data.
Descriptive statistics include:
Graphical methods include:
Before you start an EDA, you should inspect your data and correct all typos and blatent errors. EDA can then be used to identify additional errors such as outliers and help you determine appropriate statistical analyses. For this chapter we’ll use the loafercreek dataset from the CA630 Soil Survey Area.
library(aqp)
library(soilDB)
# Load from the the loakercreek dataset
data("loafercreek")
# Construct generalized horizon designations
n <- c("A",
"BAt",
"Bt1",
"Bt2",
"Cr",
"R")
# REGEX rules
p <- c("A",
"BA|AB",
"Bt|Bw",
"Bt3|Bt4|2B|C",
"Cr",
"R")
# Compute genhz labels and add to loafercreek dataset
loafercreek$genhz <- generalize.hz(loafercreek$hzname, n, p)
# Extract the horizon table
h <- horizons(loafercreek)
# Examine the matching of pairing of the genhz label to the hznames
table(h$genhz, h$hzname)
##
## 2BCt 2Bt1 2Bt2 2Bt3 2Bt4 2Bt5 2CB 2CBt 2Cr 2Crt 2R A A1 A2 AB ABt
## A 0 0 0 0 0 0 0 0 0 0 0 97 7 7 0 0
## BAt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## Bt1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
## Bt2 1 2 7 8 6 1 1 1 0 0 0 0 0 0 0 0
## Cr 0 0 0 0 0 0 0 0 4 2 0 0 0 0 0 0
## R 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0
## not-used 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## Ad Ap B BA BAt BC BCt Bt Bt1 Bt2 Bt3 Bt4 Bw Bw1 Bw2 Bw3 C CBt Cd
## A 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## BAt 0 0 0 31 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Bt1 0 0 0 0 0 0 0 8 94 89 0 0 10 2 2 1 0 0 0
## Bt2 0 0 0 0 0 5 16 0 0 0 47 8 0 0 0 0 6 6 1
## Cr 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## not-used 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## Cr Cr/R Crt H1 Oi R Rt
## A 0 0 0 0 0 0 0
## BAt 0 0 0 0 0 0 0
## Bt1 0 0 0 0 0 0 0
## Bt2 0 0 0 0 0 0 0
## Cr 49 0 20 0 0 0 0
## R 0 1 0 0 0 41 1
## not-used 0 0 0 1 24 0 0
As noted in Chapter 1, a visual examination of the raw data is possible by clicking on the dataset in the environment tab, or via commandline:
View(h)
This view is fine for a small dataset, but can be cumbersome for larger ones. The summary() function can be used to quickly summarize a dataset however, even for our small example dataset, the output can be voluminous. Therefore in the interest of saving space we’ll only look at a sample of columns.
vars <- c("genhz", "clay", "total_frags_pct", "phfield", "effclass")
summary(h[, vars])
## genhz clay total_frags_pct phfield
## A :113 Min. :10.00 Min. : 0.00 Min. :4.90
## BAt : 40 1st Qu.:18.00 1st Qu.: 0.00 1st Qu.:6.00
## Bt1 :208 Median :22.00 Median : 5.00 Median :6.30
## Bt2 :116 Mean :23.67 Mean :14.18 Mean :6.18
## Cr : 75 3rd Qu.:28.00 3rd Qu.:20.00 3rd Qu.:6.50
## R : 48 Max. :60.00 Max. :95.00 Max. :7.00
## not-used: 26 NA's :173 NA's :381
## effclass
## very slight: 0
## slight : 0
## strong : 0
## violent : 0
## none : 86
## NA's :540
##
The summary() function is known as a generic R function. It will return a preprogrammed summary for any R object. Because h is a data frame, we get a summary of each column. Factors will be summarized by their frequency (i.e., number of observations), while numeric or integer variables will print out a five number summary, and characters simply print their length. The number of missing observations for any variable will also be printed if they are present. If any of these metrics look unfamiliar to you, don’t worry we’ll cover them shortly.
When you do have missing data and the function you want to run will not run with missing values, the following options are available:
na.exclude(), such as h2 <- na.exclude(h). However this can be wasteful because it removes all rows (e.g., horizons), regardless if the row only has 1 missing value. Instead it’s sometimes best to create a temporary copy of the variable in question and then remove the missing variables, such as clay <- na.exclude(h$clay).h$clay <- ifelse(is.na(h$clay), 0, h$clay) # or h[is.na(h$clay), ] <- 0.na.rm.A quick check for typos would be to examine the list of levels for a factor or character, such as:
# just for factors
levels(h$genhz)
## [1] "A" "BAt" "Bt1" "Bt2" "Cr" "R" "not-used"
# for characters and factors
sort(unique(h$hzname))
## [1] "2BCt" "2Bt1" "2Bt2" "2Bt3" "2Bt4" "2Bt5" "2CB" "2CBt" "2Cr" "2Crt"
## [11] "2R" "A" "A1" "A2" "AB" "ABt" "Ad" "Ap" "B" "BA"
## [21] "BAt" "BC" "BCt" "Bt" "Bt1" "Bt2" "Bt3" "Bt4" "Bw" "Bw1"
## [31] "Bw2" "Bw3" "C" "CBt" "Cd" "Cr" "Cr/R" "Crt" "H1" "Oi"
## [41] "R" "Rt"
If the unique() function returned typos such as “BT” or “B t”, you could either fix your original dataset or you could make an adjustment in R, such as:
h$hzname <- ifelse(h$hzname == "BT", "Bt", h$hzname)
# or
h$hzname[h$hzname == "BT"] <- "Bt"
# or as a last resort we could manually edit the spreadsheet in R
edit(h)
Typo errors such as these are a common problem with old pedon data in NASIS.
# gopheridge rules
n <- c('A', 'Bt1', 'Bt2', 'Bt3','Cr','R')
p <- c('^A|BA$', 'Bt1|Bw','Bt$|Bt2', 'Bt3|CBt$|BCt','Cr','R')
| Parameter | NASIS | Description | R function |
|---|---|---|---|
| Mean | RV ? | arithmetic average | mean() |
| Median | RV | middle value, 50% quantile | median() |
| Mode | RV | most frequent value | sort(table(), decreasing = TRUE)[1] |
| Standard Deviation | L & H ? | variation around mean | sd() |
| Quantiles | L & H | percent rank of values, such that all values are <= p | quantile() |
These measures are used to determine the mid-point of the range of observed values. In NASIS speak this should ideally be equivalent to the representative value (RV) for numeric and integer data. The mean and median are the most commonly used measures for our purposes.
Mean - is the arithmetic average all are familiar with, formally expressed as: \(\bar{x} =\frac{\sum_{i=1}^{n}x_i}{n}\) which sums ( \(\sum\) ) all the X values in the sample and divides by the number (n) of samples. It is assumed that all references in this document refer to samples rather than a population.
The mean clay content from the loafercreek dataset may be determined:
# first remove missing values and create a new vector
clay <- na.exclude(h$clay)
mean(clay)
## [1] 23.6713
# or use the additional na.rm argument
mean(h$clay, na.rm = TRUE)
## [1] 23.6713
Median is the middle measurement of a sample set, and as such is a more robust estimate of central tendency than the mean. This is known as the middle or 50th quantile, meaning there are an equal number of samples with values less than and greater than the median. For example, assuming there are 21 samples, sorted in ascending order, the median would be the 11th sample.
The median from the sample dataset may be determined:
median(clay)
## [1] 22
Mode - is the most frequent measurement in the sample. The use of mode is typically reserved for factors, which we will discuss shortly. One issue with using the mode for numeric data is that the data need to be rounded to the level of desired precision. R does not include a function for calculating the mode, but we can calculate it using the following example.
sort(table(round(h$clay)), decreasing = TRUE)[1] # sort and select the 1st value, which will be the mode
## 25
## 41
Frequencies
To summarize factors and characters we can examine their frequency or number of observations. This is accomplished using the table() or xtabs() functions.
table(h$genhz)
##
## A BAt Bt1 Bt2 Cr R not-used
## 113 40 208 116 75 48 26
# or
# summary(h$genhz)
This gives us a count of the number of observations for each horizon. If we want to see the comparison between two different factors or characters, we can include two variables.
table(h$genhz, h$texcl)
| cos | s | fs | vfs | lcos | ls | lfs | lvfs | cosl | sl | fsl | vfsl | l | sil | si | scl | cl | sicl | sc | sic | c | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 78 | 27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| BAt | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 31 | 4 | 0 | 0 | 2 | 1 | 0 | 0 | 0 |
| Bt1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 125 | 20 | 0 | 4 | 46 | 5 | 0 | 1 | 2 |
| Bt2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 28 | 5 | 0 | 5 | 52 | 3 | 0 | 1 | 16 |
| Cr | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| R | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| not-used | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
We can also easily add margin totals to the table or convert the table frequencies to proportions.
# append the table with row and column sums
addmargins(table(h$genhz, h$texcl))
# calculate the proportions relative to the rows, margin = 1 calculates for rows, margin = 2 calculates for columns, margin = NULL calculates for total observations
round(prop.table(table(h$genhz, h$texture_class), margin = 1) * 100)
knitr::kable(addmargins(table(h$genhz, h$texcl)))
| cos | s | fs | vfs | lcos | ls | lfs | lvfs | cosl | sl | fsl | vfsl | l | sil | si | scl | cl | sicl | sc | sic | c | Sum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 78 | 27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 111 |
| BAt | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 31 | 4 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 39 |
| Bt1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 125 | 20 | 0 | 4 | 46 | 5 | 0 | 1 | 2 | 204 |
| Bt2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 28 | 5 | 0 | 5 | 52 | 3 | 0 | 1 | 16 | 110 |
| Cr | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| R | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| not-used | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| Sum | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | 262 | 56 | 0 | 9 | 101 | 9 | 0 | 2 | 19 | 466 |
knitr::kable(round(prop.table(table(h$genhz, h$texture_class), margin = 1) * 100))
| br | c | cb | cl | gr | l | pg | scl | sic | sicl | sil | sl | spm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | 0 | 0 | 0 | 0 | 0 | 70 | 0 | 0 | 0 | 0 | 24 | 5 | 0 |
| BAt | 0 | 0 | 0 | 5 | 0 | 79 | 0 | 0 | 0 | 3 | 10 | 3 | 0 |
| Bt1 | 0 | 1 | 0 | 23 | 0 | 61 | 0 | 2 | 0 | 2 | 10 | 0 | 0 |
| Bt2 | 0 | 14 | 1 | 46 | 2 | 25 | 1 | 4 | 1 | 3 | 4 | 0 | 0 |
| Cr | 98 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| R | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| not-used | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 96 |
To determine the mean by a group or category, use the aggregate function:
aggregate(clay ~ genhz, data = h, mean)
## genhz clay
## 1 A 16.23113
## 2 BAt 19.53889
## 3 Bt1 24.14221
## 4 Bt2 31.35045
## 5 Cr 15.00000
To determine the median by group or category, use the aggregate command again:
aggregate(clay ~ genhz, data = h, median)
## genhz clay
## 1 A 16.0
## 2 BAt 19.5
## 3 Bt1 24.0
## 4 Bt2 30.0
## 5 Cr 15.0
# or we could use the summary() function to get both the mean and median
aggregate(clay ~ genhz, data = h, summary)
## genhz clay.Min. clay.1st Qu. clay.Median clay.Mean clay.3rd Qu. clay.Max.
## 1 A 10.00000 14.00000 16.00000 16.23113 18.00000 25.00000
## 2 BAt 14.00000 17.00000 19.50000 19.53889 20.00000 28.00000
## 3 Bt1 12.00000 20.00000 24.00000 24.14221 28.00000 51.40000
## 4 Bt2 10.00000 26.00000 30.00000 31.35045 35.00000 60.00000
## 5 Cr 15.00000 15.00000 15.00000 15.00000 15.00000 15.00000
These are measures used to determine the spread of values around the mid-point. This is useful to determine if the samples are spread widely across the range of observations or concentrated near the mid-point. In NASIS speak these values might equate to the low (L) and high (H) values for numeric and integer data.
Variance is a positive value indicating deviation from the mean:
\(s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2} {n - 1}\)
This is the square of the sum of the deviations from the mean, divided by the number of samples minus 1. It is commonly referred to as the sum of squares. As the deviation increases, the variance increases. Conversely, if there is no deviation, the variance will equal 0. As a squared value, variance is always positive. Variance is an important component for many statistical analyses including the most commonly referred to measure of dispersion, the standard deviation. Variance for the sample dataset is:
var(h$clay, na.rm=TRUE)
## [1] 64.89187
Standard Deviation is the square root of the variance:
\(s = \sqrt\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2} {n - 1}\)
The units of the standard deviation are the same as the units measured. From the formula you can see that the standard deviation is simply the square root of the variance. Standard deviation for the sample dataset is:
sd(h$clay, na.rm = TRUE)
## [1] 8.055549
# or
# sqrt(var(clay))
Coefficient of Variation (CV) is a relative (i.e., unitless) measure of standard deviation:
\(CV = \frac{s}{\bar{x}} \times 100\)
CV is calculated by dividing the standard deviation by the mean and multiplying by 100. Since standard deviation varies in magnitude with the value of the mean, the CV is useful for comparing relative variation amongst different datasets. However Webster (2001) discourages using CV to compare different variables. Webster (2001) also stresses that CV is reserved for variables that have an absolute 0, like clay content. CV may be calculated for the sample dataset as:
cv <- sd(clay) / mean(clay) * 100
cv
## [1] 34.03087
Quantiles (a.k.a. Percentiles) - the percentile is the value that cuts off the first nth percent of the data values when sorted in ascending order.
The default for the quantile() function returns the min, 25th percentile, median or 50th percentile, 75th percentile, and max, known as the five number summary originally proposed by Tukey. Other probabilities however can be used. At present the 5th, 50th, and 95th are being proposed for determining the range in characteristics (RIC) for a given soil property.
quantile(clay)
## 0% 25% 50% 75% 100%
## 10 18 22 28 60
# or
quantile(clay, c(0.05, 0.5, 0.95))
## 5% 50% 95%
## 13.0 22.0 38.4
Thus, for the five number summary 25% of the observations fall between each of the intervals. Quantiles are a useful metric because they are largely unaffected by the distribution of the data, and have a simple interpretation.
Range is the difference between the highest and lowest measurement of a group. Using the sample data it may be determined as:
range(clay)
## [1] 10 60
which returns the minimum and maximum values observed, or:
diff(range(clay))
## [1] 50
# or
# max(clay) - min(clay)
which returns the value of the range
Interquartile Range (IQR) is the range from the upper (75%) quartile to the lower (25%) quartile. This represents 50% of the observations occurring in the mid-range of a sample. IQR is a robust measure of dispersion, unaffected by the distribution of data. In soil survey lingo you could consider the IQR to estimate the central concept of a soil property. IQR may be calculated for the sample dataset as:
IQR(clay)
## [1] 10
# or
# diff(quantile(clay, c(0.25, 0.75)))
A correlation matrix is a table of the calculated correlation coefficients of all variables. This provides a quantitative measure to guide the decision making process. The following will produce a correlation matrix for the sp4 dataset:
h$hzdepm <- (h$hzdepb + h$hzdept) / 2 # Compute the middle horizon depth
vars <- c("hzdepm", "clay", "sand", "total_frags_pct", "phfield")
round(cor(h[, vars], use = "complete.obs"), 2)
## hzdepm clay sand total_frags_pct phfield
## hzdepm 1.00 0.59 -0.08 0.50 -0.03
## clay 0.59 1.00 -0.17 0.28 0.13
## sand -0.08 -0.17 1.00 -0.05 0.12
## total_frags_pct 0.50 0.28 -0.05 1.00 -0.16
## phfield -0.03 0.13 0.12 -0.16 1.00
As seen in the output, variables are perfectly correlated with themselves and have a correlation coefficient of 1.0. Negative values indicate a negative relationship between variables. What is considered highly correlated? A good rule of thumb is anything with a value of 0.7 or greater is considered highly correlated.
Now that we’ve checked for missing values and typos and made corrections, we can graphically examine the sample data distribution of our data. Frequency distributions are useful because they can help us visualize the center (e.g., RV) and spread or dispersion (e.g., low and high) of our data. Typically in introductory statistics the normal (i.e., Gaussian) distribution is emphasized.
| Plot Types | Description |
|---|---|
| Bar | a plot where each bar represents the frequency of observations for a ‘group’ |
| Histogram | a plot where each bar represents the frequency of observations for a ‘given range of values’ |
| Density | an estimation of the frequency distribution based on the sample data |
| Quantile-Quantile | a plot of the actual data values against a normal distribution |
| Box-Whisker | a visual representation of median, quartiles, symmetry, skewness, and outliers |
| Scatter & Line | a graphical display of one variable plotted on the x axis and another on the y axis |
| Plot Types | Base R | lattice | ggplot geoms |
|---|---|---|---|
| Bar | barplot() | barchart() | geom_bar() |
| Histogram | hist() | histogram() | geom_histogram() |
| Density | plot(density()) | densityplot() | geom_density() |
| Quantile-Quantile | qqnorm() | qq() | geom_qq() |
| Box-Whisker | boxplot() | bwplot() | geom_boxplot() |
| Scatter & Line | plot() | xyplot | geom_point() |
A bar plot is a graphical display of the frequency (i.e. number of observations (count or n)), such as soil texture, that fall within a given class. It is a graphical alternative to to the table() function.
library(ggplot2)
# bar plot
ggplot(h, aes(x = texcl)) +
geom_bar()
A histogram is similar to a bar plot, except that instead of summarizing categorical data, it categorizes a continuous variable like clay content into non-overlappying intervals for the sake of display. The number of intervals can be specified by the user, or can be automatically determined using an algorithm, such as nclass.Sturges(). Since histograms are dependent on the number of bins, for small datasets they’re not the best method of determining the shape of a distribution.
ggplot(h, aes(x = clay)) +
geom_histogram(bins = nclass.Sturges(h$clay))
A density estimation, also known as a Kernel density plot, generally provides a better visualization of the shape of the distribution in comparison to the histogram. Compared to the histogram where the y-axis represents the number or percent (i.e., frequency) of observations, the y-axis for the density plot represents the probability of observing any given value, such that the area under the curve equals one. One curious feature of the density curve is the hint of a two peaks (i.e. bimodal distribution?). Given that our sample includes a mixture of surface and subsurface horizons, we may have two different populations. However considering how much the two distributions overlap, it seems impractical to separate them in this instance.
ggplot(h, aes(x = clay)) +
geom_density()
Box plots are a graphical representation of the five number summary, depicting quartiles (i.e. the 25%, 50%, and 75% quantiles), minimum, maximum and outliers (if present). Boxplots convey the shape of the data distribution, the presence of extreme values, and the ability to compare with other variables using the same scale, providing an excellent tool for screening data, determining thresholds for variables and developing working hypotheses.
The parts of the boxplot are shown in the figure below. The “box” of the boxplot is defined as the 1st quartile, (Q1 in the figure) and the 3rd quartile, (Q3 in the figure). The median, or 2nd quartile, is the dark line in the box. The whiskers (typically) show data that is 1.5 * IQR above and below the 3rd and 1st quartile. Any data point that is beyond a whisker is considered an outlier.
That is not to say the outlier points are in error, just that they are extreme compared to the rest of the dataset. However, you may want to evaluate these points to ensure that they are correct.
Boxplot description (Seltman, 2009)
ggplot(h, (aes(x = genhz, y = clay))) +
geom_boxplot()
The above box plot shows a steady increase in clay content with depth. Notice the outliers in the box plots, identified by the individual circles.
A QQ plot is a plot of the actual data values against a normal distribution (which has a mean of 0 and standard deviation of 1).
# QQ Plot for Clay
ggplot(h, aes(sample = clay)) +
geom_qq() +
geom_qq_line()
# QQ Plot for Frags
ggplot(h, aes(sample = total_frags_pct)) +
geom_qq() +
geom_qq_line()
If the data set is perfectly symmetric (i.e. normal), the data points will form a straight line. Overall this plot shows that our clay example is more or less symmetric. However the second plot shows that our rock fragments are far from evenly distributed.
A more detailed explanation of QQ plots may be found on Wikipedia:
https://en.wikipedia.org/wiki/QQ_plot
What is a normal distribution and why should you care? Many statistical methods are based on the properties of a normal distribution. Applying certain methods to data that are not normally distributed can give misleading or incorrect results. Most methods that assume normality are robust enough for all data except the very abnormal. This section is not meant to be a recipe for decision making, but more an extension of tools available to help you examine your data and proceed accordingly. The impact of normality is most commonly seen for parameters used by pedologists for documenting the ranges of a variable (i.e., Low, RV and High values). Often a rule-of thumb similar to: “two standard deviations” is used to define the low and high values of a variable. This is fine if the data are normally distributed. However, if the data are skewed, using the standard deviation as a parameter does not provide useful information of the data distribution. The quantitative measures of Kurtosis (peakedness) and Skewness (symmetry) can be used to assist in accessing normality and can be found in the fBasics package, but Webster (2001) cautions against using significance tests for assessing normality. The preceding sections and chapters will demonstrate various methods to cope with alternative distributions.
A Gaussian distribution is often referred to as “Bell Curve”, and has the following properties (Lane):
Viewing a histogram or density plot of your data provides a quick visual reference for determining normality. Distributions are typically normal, Bimodal or Skewed:
Examples of different types of distributions
Occasionally distributions are Uniform, or nearly so:
With the loafercreek dataset the mean and median for clay were only slightly different, so we can safely assume that we have a normal distribution. However many soil variables often have a non-normal distribution. For example, let’s look at graphical examination of the mean vs. median for clay and rock fragments:
The solid lines represent the breakpoint for the mean and standard deviations. The dashed lines represents the median and quantiles. The median is a more robust measure of central tendency compared to the mean. In order for the mean to be a useful measure, the data distribution must be approximately normal. The further the data departs from normality, the less meaningful the mean becomes. The median always represents the same thing independent of the data distribution, namely, 50% of the samples are below and 50% are above the median. The example for clay again indicates that distribution is approximately normal. However for rock fragments, we see a long tailed distribution (e.g., skewed). Using the mean in this instance would overestimate the rock fragments. Although in this instance the difference between the mean and median is only 8 percent.
Plotting points of one ratio or interval variable against another is a scatter plot. Plots can be produced for a single or multiple pairs of variables. Many independent variables are often under consideration in soil survey work. This is especially common when GIS is used, which offers the potential to correlate soil attributes with a large variety of raster datasets.
The purpose of a scatterplot is to see how one variable relates to another. With modeling in general the goal is parsimony (i.e., simple). The goal is to determine the fewest number of variables required to explain or describe a relationship. If two variables explain the same thing, i.e., they are highly correlated, only one variable is needed. The scatterplot provides a perfect visual reference for this.
Create a basic scatter plot using the loafercreek dataset.
# scatter plot
ggplot(h, aes(x = clay, y = hzdepm)) +
geom_point() +
ylim(100, 0)
# line plot
ggplot(h, aes(y = clay, x = hzdepm, group = peiid)) +
geom_line() +
coord_flip() +
xlim(100, 0)
This plots clay on the X axis and depth on the X axis. As shown in the scatterplot above, there is a moderate correlation between these variables.
The function below produces a scatterplot matrix for all the numeric variables in the dataset. This is a good command to use for determining rough linear correlations for continuous variables.
# Load the GGally package
library(GGally)
# Create a scatter plot matrix
vars <- c("hzdepm", "clay", "phfield", "total_frags_pct")
ggpairs(h[vars])
# scatter plot
ggplot(h, aes(x = clay, y = hzdepm, color = genhz)) +
geom_point(size = 3) +
ylim(100, 0)
# density plot
ggplot(h, aes(x = clay, color = genhz)) +
geom_density(size = 2)
# bar plot
ggplot(h, aes(x = genhz, fill = texture_class)) +
geom_bar()
# box plot
ggplot(h, aes(x = genhz, y = clay)) +
geom_boxplot()
# heat map (pseudo bar plot)
s <- site(loafercreek)
ggplot(s, aes(x = landform_string, y = pmkind)) +
geom_tile(alpha = 0.2)
# convert to long format
df <- reshape2::melt(h,
id.vars = c("peiid", "genhz", "hzdepm"),
measure.vars = c("clay", "phfield", "total_frags_pct")
)
ggplot(df, aes(x = genhz, y = value)) +
geom_boxplot() +
xlab("genhz") +
facet_wrap(~ variable, scales = "free_y")
library(aqp)
s <- slice(loafercreek, 0:100 ~ clay + phfield + total_frags_pct)
s <- slab(s, fm = ~ clay + phfield + total_frags_pct,
slab.fun = function(x) quantile(x, c(0.1, 0.5, 0.9), na.rm = TRUE)
)
ggplot(s, aes(x = top, y = X50.)) +
geom_line() +
geom_ribbon(aes(ymin = X10., ymax = X90., x = top), alpha = 0.2) +
xlim(c(100, 0)) +
coord_flip() +
facet_wrap(~ variable, scales = "free_x")
Slope aspect and pH are two common variables warranting special consideration for pedologists.
Since pH has a logarithmic distribution, the use of median and quantile ranges are the preferred measures when summarizing pH. Remember, pHs of 6 and 5 correspond to hydrogen ion concentrations of 0.000001 and 0.00001 respectively. The actual average is 5.26; -log10((0.000001 + 0.00001) / 2). If no conversions are made for pH, the mean and sd in the summary are considered the geometric mean and sd, not the arithmetic. The wider the pH range, the greater the difference between the geometric and arithmetic mean. The difference between the correct average of 5.26 and the incorrect of 5.5 is small, but proper handling of data types is a best practice.
If you have a table with pH values and wish to calculate the arithmetic mean using R, this example will work:
# arithmetic mean
log10(mean(1/10^-h$phfield, na.rm = TRUE))
## [1] 6.371026
# geometric mean
mean(h$phfield, na.rm = TRUE)
## [1] 6.18
Slope aspect - requires the use of circular statistics for summarizing numerically, or graphical interpretation using circular plots. For example, if soil map units being summarized have a uniform distribution of slope aspects ranging from 335 degrees to 25 degrees, the Zonal Statistics tool in ArcGIS would return a mean of 180.
The most intuitive means available for evaluating and describing slope aspect are circular plots available with the circular package in R and the radial plot option in the TEUI Toolkit. The circular package in R will also calculate circular statistics like mean, median, quartiles etc.
library(circular)
# Extract the site table
s <- site(loafercreek)
aspect <- s$aspect_field
aspect <- circular(aspect, template="geographic", units="degrees", modulo="2pi")
summary(aspect)
## n Min. 1st Qu. Median Mean 3rd Qu. Max. Rho
## 101.0000 12.0000 255.0000 195.0000 199.5000 115.0000 20.0000 0.1772
## NA's
## 5.0000
The numeric output is fine, but a following graphic is more revealing, which shows the dominant Southwest slope aspect.
rose.diag(aspect, bins = 8, col="grey")
If you haven’t yet done so, please setup the sample data sets for this module.
# store path as a variable, in case you want to keep it somewhere else
ch2b.data.path <- 'C:/workspace/chapter-2b'
# make a place to store chapter 2b example data
dir.create(ch2b.data.path, recursive = TRUE)
# download example data from github
# polygons
download.file('https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/chapter_2b-spatial-data/chapter-2b-mu-polygons.zip', paste0(ch2b.data.path, '/chapter-2b-mu-polygons.zip'))
# raster data
download.file('https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/chapter_2b-spatial-data/chapter-2b-PRISM.zip', paste0(ch2b.data.path, '/chapter-2b-PRISM.zip'))
# unzip
unzip(paste0(ch2b.data.path, '/chapter-2b-mu-polygons.zip'), exdir = ch2b.data.path, overwrite = TRUE)
unzip(paste0(ch2b.data.path, '/chapter-2b-PRISM.zip'), exdir = ch2b.data.path, overwrite = TRUE)
Time to load some additional R packages for handling spatial data.
library(aqp)
library(sp)
library(raster)
library(rgdal)
library(soilDB)
We will be using polygons associated with MLRAs 15 and 18 as part of this demonstration. Import these data now with readOGR(); recall the somewhat strange syntax.
# establish path to example data
ch2b.data.path <- 'C:/workspace/chapter-2b'
# load MLRA polygons
mlra <- readOGR(dsn=ch2b.data.path, layer='mlra-18-15-AEA')
Next, load in the example raster data with the raster() function. These are 800m PRISM data that have been cropped to the extent of the MLRA 15 and 18 polygons.
# mean annual air temperature, Deg C
maat <- raster(paste0(ch2b.data.path, '/MAAT.tif'))
# mean annual precipitation, mm
map <- raster(paste0(ch2b.data.path, '/MAP.tif'))
# frost-free days
ffd <- raster(paste0(ch2b.data.path, '/FFD.tif'))
# growing degree days
gdd <- raster(paste0(ch2b.data.path, '/GDD.tif'))
# percent of annual PPT as rain
rain_fraction <- raster(paste0(ch2b.data.path, '/rain_fraction.tif'))
# annual sum of monthly PPT - ET_p
ppt_eff <- raster(paste0(ch2b.data.path, '/effective_preciptitation.tif'))
Sometimes it is convenient to “stack” raster data that share a common grid size, extent, and coordinate reference system into a single RasterStack object.
rs <- stack(maat, map, ffd, gdd, rain_fraction, ppt_eff)
# reset layer names
names(rs) <- c('MAAT', 'MAP', 'FFD', 'GDD', 'rain.fraction', 'eff.PPT')
The seriesExtent() function from the soilDB package provides a simple interface to Series Extent Explorer data files. Note that these series extents have been generalized for rapid display at regional to continental scales. A more precise representation of “series extent” can be generated from SSURGO polygons and queried from SDA.
Get an approximate extent for the Amador soil series from SEE. See the manual page for seriesExtent for additional options and related functions.
amador <- seriesExtent(s = 'amador')
class(amador)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
Generate 100 sampling points within the extent, using a hexagonal grid. These will be used to extract raster values from our RasterStack of PRISM data.
s <- spsample(amador, n = 100, type = 'hexagonal')
For comparison, extract a single point from each SSURGO map unit delineation that contains Amador as a major component. This will require a query to SDA for the set of matching map unit keys (mukey), followed by a second request to SDA for the geometry. The SDA_query function is used to send arbitrary queries written in SQL to SDA, the results may be a data.frame or list, depending on the complexity of the query. The fetchSDA_spatial function returns map unit geometry as either polygons, polygon envelopes, or a single point per polygon and selected by mukey or nationalmusym.
# result is a data.frame
mukeys <- SDA_query("
SELECT DISTINCT mukey
FROM component
WHERE compname = 'Amador'
AND majcompflag = 'Yes' ;"
)
# result is a SpatialPointsDataFrame
amador.pts <- fetchSDA_spatial(mukeys$mukey, by.col = 'mukey', method = 'point', chunk.size = 2)
Graphically check both methods
# adjust margins and setup plot device for two columns
par(mar=c(1,1,3,1), mfcol=c(1,2))
# first figure
plot(maat, ext=extent(s), main='PRISM MAAT\n100 Sampling Points from Extent', axes=FALSE)
plot(amador, add=TRUE)
points(s, cex=0.25)
plot(maat, ext=extent(s), main='PRISM MAAT\nPolygon Centroids', axes=FALSE)
points(amador.pts, cex=0.25)
Extract PRISM data (the RasterStack object we made earlier) at the sampling locations (100 regularly-spaced and from MU polygon centroids) and summarize. Note that CRS transformations are automatic (when possible).
# return the result as a data.frame object
e <- extract(rs, s, df=TRUE)
e.pts <- extract(rs, amador.pts, df=TRUE)
# check out the extracted data
summary(e[, -1])
## MAAT MAP FFD GDD
## Min. :16.52 Min. :324.0 Min. :311.0 Min. :2591
## 1st Qu.:16.61 1st Qu.:428.2 1st Qu.:324.0 1st Qu.:2629
## Median :16.65 Median :482.0 Median :330.0 Median :2644
## Mean :16.65 Mean :476.3 Mean :328.6 Mean :2645
## 3rd Qu.:16.69 3rd Qu.:527.5 3rd Qu.:334.0 3rd Qu.:2665
## Max. :16.86 Max. :611.0 Max. :340.0 Max. :2714
## rain.fraction eff.PPT
## Min. : 99.00 Min. :-567.4
## 1st Qu.: 99.00 1st Qu.:-428.2
## Median : 99.00 Median :-363.2
## Mean : 99.02 Mean :-373.3
## 3rd Qu.: 99.00 3rd Qu.:-306.9
## Max. :100.00 Max. :-213.3
# all pair-wise correlations
knitr::kable(cor(e[, -1]), digits = 2)
| MAAT | MAP | FFD | GDD | rain.fraction | eff.PPT | |
|---|---|---|---|---|---|---|
| MAAT | 1.00 | -0.53 | -0.17 | 0.75 | 0.21 | -0.52 |
| MAP | -0.53 | 1.00 | 0.69 | -0.90 | -0.17 | 0.99 |
| FFD | -0.17 | 0.69 | 1.00 | -0.65 | -0.12 | 0.72 |
| GDD | 0.75 | -0.90 | -0.65 | 1.00 | 0.18 | -0.89 |
| rain.fraction | 0.21 | -0.17 | -0.12 | 0.18 | 1.00 | -0.18 |
| eff.PPT | -0.52 | 0.99 | 0.72 | -0.89 | -0.18 | 1.00 |
Quickly compare the two sets of samples. More on this in the Sampling module.
# compile results into a list
maat.comparison <- list(
'regular samples'=e$MAAT,
'polygon centroids'=e.pts$MAAT
)
# number of samples per method
lapply(maat.comparison, length)
## $`regular samples`
## [1] 90
##
## $`polygon centroids`
## [1] 779
# summary() applied by group
lapply(maat.comparison, summary)
## $`regular samples`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.52 16.61 16.65 16.65 16.69 16.86
##
## $`polygon centroids`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.18 16.61 16.65 16.66 16.72 17.10
# box-whisker plot
par(mar=c(4.5, 8, 3, 1), mfcol=c(1,1))
boxplot(maat.comparison, horizontal=TRUE, las=1, xlab='MAAT (deg C)',
varwidth=TRUE, boxwex=0.5,
main='MAAT Comparison')
Basic climate summaries from a standardized source (e.g. PRISM) might be a useful addition to an OSD. Think about how you could adapt this example to compare climate summaries derived from NASIS pedons to similar summaries derived from map unit polygons and generalized soil series extents.
The following example is a simplified version of what is available in the soilReports package, reports on the ncss-tech GitHub repository, or in the TEUI suite of map unit summary tools. The Computing GIS Summaries from Map Unit Polygons tutorial is an expanded version of this example.
Example output from the soilReports package:
Efficient summary of large raster data sources can be accomplished using:
Back to our example data. The first step is to check the MLRA polygons (mlra); how many features per MLRA symbol? Note that some MLRA have more than one polygon.
table(mlra$MLRARSYM)
##
## 15 18
## 11 1
Convert polygon area from square meters to acres and summarize. Note that this will only make sense when using a projected CRS with units of meters!
poly.area <- round(sapply(mlra@polygons, slot, 'area') * 0.000247105)
summary(poly.area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 43 26870 1392380 956044 8398251
sum(poly.area)
## [1] 16708561
Sample each polygon at a constant sampling density of 0.001 samples per acre (1 sample for every 1,000 ac.). At this sampling density we should expect approximately 16,700 samples–more than enough for our simple example.
library(sharpshootR)
# the next function requires a polygon ID: each polygon gets a unique number 1--number of polygons
mlra$pID <- 1:nrow(mlra)
s <- constantDensitySampling(mlra, n.pts.per.ac=0.001)
Extract MLRA symbol at sample points using the over() function. The result will be a data.frame object with all attributes from our MLRA polygons that intersect sampling points s.
# spatial overlay: sampling points and MLRA polygons
res <- over(s, mlra)
# row / feature order is preserved, so we can directly copy
s$mlra <- res$MLRARSYM
# tabulate number of samples per MLRA
table(s$mlra)
##
## 15 18
## 11521 5151
Extract values from the RasterStack of PRISM data as a data.frame.
# raster stack extraction at sampling points
e <- extract(rs, s, df=TRUE)
# convert sampling points from SpatialPointsDataFrame to data.frame
s.df <- as(s, 'data.frame')
# join columns from extracted values and sampling points
s.df <- cbind(s.df, e)
# check results
head(s.df)
## pID mlra x1 x2 ID MAAT MAP FFD GDD rain.fraction eff.PPT
## 1 1 15 -2244981 1979490 1 15.51883 505 339 2351 100 -268.1914
## 2 1 15 -2242969 1979490 2 15.51962 496 334 2355 99 -272.2041
## 3 1 15 -2246993 1981502 3 15.34560 522 340 2312 100 -243.1435
## 4 1 15 -2244981 1981502 4 15.21670 542 337 2282 99 -217.1987
## 5 1 15 -2242969 1981502 5 15.56846 531 331 2374 100 -246.6457
## 6 1 15 -2251018 1983514 6 14.93303 537 339 2191 100 -210.8704
Summarizing multivariate data by group (MLRA) is usually much simpler after reshaping data from “wide” to “long” format.
library(lattice)
library(reshape2)
# reshape from wide to long format
m <- melt(s.df, id.vars = c('mlra'), measure.vars = c('MAAT', 'MAP', 'FFD', 'GDD', 'rain.fraction', 'eff.PPT'))
# check "wide" format
head(m)
## mlra variable value
## 1 15 MAAT 15.51883
## 2 15 MAAT 15.51962
## 3 15 MAAT 15.34560
## 4 15 MAAT 15.21670
## 5 15 MAAT 15.56846
## 6 15 MAAT 14.93303
A simple tabular summary of means by MLRA and PRISM variable using tapply().
# tabular summary of mean values
tapply(m$value, list(m$mlra, m$variable), mean)
## MAAT MAP FFD GDD rain.fraction eff.PPT
## 15 15.25793 586.7663 284.1521 2390.701 98.62373 -198.3553
## 18 15.66721 631.2677 273.2664 2497.012 97.24617 -193.2831
Lattice graphics are useful for summarizing grouped comparisons. The syntax is a little rough, but there is a lot of documentation online. Box and whisker plots are usually the first step in a graphical comparison of samples. We will go over the interpretation of these kind of figures in chapter 4.
tps <- list(box.rectangle=list(col='black'), box.umbrella=list(col='black', lty=1), box.dot=list(cex=0.75), plot.symbol=list(col=rgb(0.1, 0.1, 0.1, alpha = 0.25, maxColorValue = 1), cex=0.25))
bwplot(mlra ~ value | variable, data=m, # setup plot and data source
as.table=TRUE, # start panels in top/left corner
varwidth=TRUE, # scale width of box by number of obs
scales=list(alternating=3, relation='free'), # setup scales
strip=strip.custom(bg=grey(0.9)), # styling for strips
par.settings=tps, # apply box/line/point styling
panel=function(...) { # within in panel, do the following
panel.grid(-1, -1) # make grid lines at all tick marks
panel.bwplot(...) # make box-whisker plot
}
)
Static figures are a good start, but sometimes an interactive summary is more appropriate for EDA. The histboxp function from the Hmisc package creates interactive (HTML) chunks that can be used in RStudio, or embedded in RMarkdown documents (e.g. these notes or soilReports). Use the mouse to hover over, click, drag, etc. to interact with the data. Double-click to reset the plot.
library(Hmisc)
# interactive exploration of the MAAT distributions by MLRA
histboxp(x=s.df$MAAT, group=s.df$mlra, bins=100, xlab='Mean Annual Air Temperature (deg C)')
# interactive exploration of the Effective PPT distributions by MLRA
histboxp(x=s.df$eff.PPT, group=s.df$mlra, bins=100, xlab='Annual Sum of Monthly PPT - PET (mm)')
One of the strengths of NASIS is that it that has a wealth of existing queries and reports to access the complex data gathered and created during the Soil Survey effort. This makes it easy for the average user to load their data and run numerous reports.
The NCSS-tech R package soilDB similarly has many functions to load data from a variety of sources, NASIS included. Recently, the soilReports package was developed to replicate and expand on NASIS’s report repository.
soilReports contains a small collection of reports mostly designed to summarize field and laboratory pedon data, as well as map unit / LRU / MLRA polygons. This collection of reports automates many methods we have reviewed so far, but are no substitute for interacting with your data and asking questions.
Example report output can be found at the following link: https://github.com/ncss-tech/soilReports#example-output.
Detailed instructions are provided for each report: https://github.com/ncss-tech/soilReports#choose-an-available-report
Running and interpreting one of these reports would be a fine class project.
You can convert any of your R-based analyses to the soilReports format. The soilReports R package is essentially just a collection of R Markdown (.Rmd) reports, each with a “manifest” that describes any required dependencies, configuration files or inputs.
R Markdown is a markup format for creating reproducible, dynamic reports with R. It allows R code and text to be mingled in the same document and executed like an R script. This allows R to generate reports similar to, and even more powerful than, NASIS.
Let’s demonstrate how to run an existing .Rmd to summarize laboratory data for a soil series.
Load your NASIS selected set. Run a query such as “POINT - Pedon/Site/NCSSlabdata by upedonid and Taxon Name” from the Region 11 report folder to load your selected set. Be sure to target both the site, pedon and lab layer tables. Remove from your selected set the pedons and sites you wish to exclude from the report.
Install/re-install the soilReports package. This package is updated regularly (e.g. weekly), and should be installed from GitHub regularly.
# Install the soilReports package from GitHub
remotes::install_github("ncss-tech/soilReports", dependencies=FALSE, build=FALSE)
# Load the soilReports and rmarkdown package
library(soilReports)
library(rmarkdown)
# List reports
listReports()
# Copy report to your directory
copyReport(reportName = "region11/lab_summary_by_taxonname", outputDir = "C:/workspace2/lab_sum")
report.Rmd. Notice there are several other support files. The parameters for the report are contained in the config.R file.If none exists see the following job aid on how to create one, Assigning Generalized Horizon Labels.
Pay special attention to how caret ^ and dollar $ symbols are used in REGEX. They function as anchors for the beginning and end of the string, respectively.
A ^ placed before an A horizon, ^A, will match any horizon designation that starts with A, such as Ap, Ap1, but not something merely containing A, such as BA.
Placing a $ after a Bt horizon, Bt$, will match any horizon designation that ends with Bt, such as 2Bt or 3Bt, but not something with a vertical subdivision, such as Bt2.
Wrapping pattern with both ^ and $ symbols will result only in exact matches – i.e. that start and end with the contents between ^ and $. For example ^[AC]$, will only match A or C, not Ap, Ap2, or Cg.
Command-line approach
# Set report parameters
series <- "Miami"
genhz_rules <- "C:/workspace2/lab_sum/Miami_rules.R"
# report file path
report_path <- "C:/workspace2/lab_sum/report.Rmd"
# Run the report
render(input = report_path,
output_dir = "C:/workspace2",
output_file = "C:/workspace2/lab_sum.html",
envir = new.env(),
params = list(series = series,
genhz_rules = genhz_rules
)
)
Manual approach
Open the report.Rmd, hit the Knit drop down arrow, and select Knit with Parameters.
Also, beware when opening the .html file with Internet Explorer – be sure to click on “Allow blocked content” if prompted. Otherwise, Internet Explorer will alter the formatting of tables etc. within the document.
Sample pedon report
lab_summary_by_taxonname.Rmd report on a soil series of your choice.Another popular report in soilReports is the region2/mu-comparison report.
This report uses constant density sampling (sharpshootR::constantDensitySampling()) to extract numeric and categorical values from multiple raster data sources that overlap a set of user-supplied polygons.
In this example, we clip a small portion of SSURGO polygons from the CA630 soil survey area extent. We then select a small set of mapunit symbols (5012, 3046, 7083, 7085, 7088) that occur within the clipping extent. These mapunits have soil forming factors we expect to contrast with one another in several ways. You can inspect other mapunit symbols by changing mu.set in config.R.
# set up ch4 path and path for report
ch4.data.path <- "C:/workspace2/chapter4"
ch4.mucomp.path <- paste0(ch4.data.path,"/mucomp")
# create any directories that may be missing
if(!dir.exists(ch4.mucomp.path)) {
dir.create(ch4.mucomp.path, recursive = TRUE)
}
# download raster data, SSURGO clip from CA630, and sample script for clipping your own raster data
download.file('https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/chapter_4-mucomp-data/ch4_mucomp-data.zip', paste0(ch4.mucomp.path, '/chapter_4-mucomp-data.zip'))
unzip(paste0(ch4.mucomp.path, '/chapter_4-mucomp-data.zip'), exdir = ch4.mucomp.path, overwrite = TRUE)
region2/mu-comparison report with soilReports:# create new instance of reports
library(soilReports)
# get report dependencies
reportSetup('region2/mu-comparison')
# create report instance
copyReport('region2/mu-comparison', outputDir = ch4.mucomp.path, overwrite = TRUE)
If you want, you can now set up the default config.R that is created by copyReport() to work with your own data. OR you can use the “sample” config.R file (called new_config.R) in the ZIP file downloaded above.
config.R with the sample config.R:# copy config file containing relative paths to rasters downloaded above
file.copy(paste0(ch4.mucomp.path, "/new_config.R"), paste0(ch4.mucomp.path,"/config.R"), overwrite = TRUE)
Open report.Rmd in the C:/workspace2/chapter4/mucomp folder and click the “Knit” button at the top of the RStudio source pane to run the report.
Inspect the report output HTML file, as well as the spatial and tabular data output in the output folder.
Question: What are the major differences that you can see, based on the report, between the five different mapunit symbols that were analysed?
The region2/shiny-pedon-summary report is an interactive Shiny-based report that uses flexdashboard to help the user subset and summarize NASIS pedons from a graphical interface. This allows one to rapidly generate reports from a large set of pedons in their NASIS selected set.
The left INPUT sidebar has numerous options for subsetting pedon data. Specifically, you can change REGEX patterns for mapunit symbol, taxon name, local phase, and User Pedon ID. Also, you can use the drop down boxes to filter on taxon kind or compare different “modal”/RV pedons.
Example: Analyzing the Loafercreek Taxadjuncts
region2/shiny-pedon-summary report with soilReports:# create new instance of reports
library(soilReports)
# set path for shiny-pedon-summary report instance
ch4.shinyped.path <- "C:/workspace2/chapter4/shiny-pedon"
# create directories (if needed)
if(!dir.exists(ch4.shinyped.path))
dir.create(ch4.shinyped.path, recursive = TRUE)
# get report dependencies
reportSetup('region2/shiny-pedon-summary')
# copy report contents to target path
copyReport('region2/shiny-pedon-summary', outputDir = ch4.shinyped.path, overwrite = TRUE)
config.R fileYou can update the config.R file in “C:/workspace2/chapter4/shiny-pedon” (or wherever you installed the report) to use the soilDB datasets loafercreek and gopheridge by setting demo_mode <- TRUE. This is the simplest way to demonstrate how this report works. Alternately, when demo_mode <- FALSE, pedons will be loaded from your NASIS selected set.
config.R also allows you to specify a shapefile for overlaying the points on – to determine mapunit symbol – as well as several raster data sources whose values will be extracted at point locations and summarized. The demo dataset does not use either of these by default, due to large file sizes.
Furthermore, a default (very general) set of REGEX generalized horizon patterns is provided to assign generalized horizon labels for provisional grouping. These provided patterns are unlikely to cover ALL cases, and essentially always need to be modified for final correlation. That said, they do a decent job of making a first-pass correlation for diverse types of soils.
The default config.R settings use the general patterns: use_regex_ghz <- TRUE. You are welcome to modify the defaults. If you want to use the values you have populated in NASIS Pedon Horizon Component Layer ID, set use_regex_ghz <- FALSE.
shiny.RmdThis report uses the Shiny flexdashboard interface. Open up shiny.Rmd and click the “Run Document” button to start the report. This will load the pedon and spatial data specified in config.R.
NOTE: If a Viewer Window does not pop-up right away, click the gear icon to the right of the “Run Document” button. Be sure the option “Preview in Window” is checked, then click “Run Document” again.
All of the subsetting parameters are in the left-hand sidebar. Play around with all of these options – the graphs and plots in the tabs to the right will automatically update as you make changes.
When you like what you have, you can export a non-interactive HTML file for your records. To do this, first, set the “Report name:” box to something informative – this will be your report output file name. Then, scroll down to the bottom of the INPUT sidebar and click “Export Report” button. Check the “output” folder (subdirectory of where you installed the report) for your results.
FAO Corporate Document Repository. http://www.fao.org/docrep/field/003/AC175E/AC175E07.htm
Lane, D.M. Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/ Project Leader: David M. Lane, Rice University
Seltman, H. 2009. Experimental Design and Analysis. Chapter 4: Exploratory Data Analysis. Carnegie Mellon University. http://www.stat.cmu.edu/~hseltman/309/Book/
Tukey, John. 1977. Exploratory Data Analysis, Addison-Wesley
Tukey, J. 1980. We need both exploratory and confirmatory. The American Statistician, 34:1, 23-25.
Webster, R. 2001. Statistics to support soil research and their presentation. European Journal of Soil Science. 52:331-340. http://onlinelibrary.wiley.com/doi/10.1046/j.1365-2389.2001.00383.x/abstract
Healy, K., 2018. Data Visualization: a practical introduction. Princeton University Press. http://socviz.co/
Helsel, D.R., and R.M. Hirsch, 2002. Statistical Methods in Water Resources Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. 522 pages. http://pubs.usgs.gov/twri/twri4a3/
Kabacoff, R.I., 2015. R in Action. Manning Publications Co. Shelter Island, NY. https://www.statmethods.net/
Kabacoff, R.I., 2018. Data Visualization in R. https://rkabacoff.github.io/datavis/
Peng, R. D., 2016. Exploratory Data Analysis with R. Leanpub. https://bookdown.org/rdpeng/exdata/
Wilke, C.O., 2019. Fundamentals of Data Visualization. O’Reily Media, Inc. https://serialmentor.com/dataviz/